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Gaussian Processes for Regression: 
A Quick Introduction 

M. Ebden, August 2008 
Comments to mebden@gmail.com 


1 Motivation 


Figure [Tjillustrates a typical example of a prediction problem: given some noisy obser¬ 
vations of a dependent variable at certain values of the independent variable x, what is 
our best estimate of the dependent variable at a new value, X* ? 

If we expect the underlying function f(x) to be linear, and can make some as¬ 
sumptions about the input data, we might use a least-squares method to fit a straight 
line (linear regression). Moreover, if we suspect f{x) may also be quadratic, cubic, or 
even nonpolynomial, we can use the principles of model selection to choose among the 
various possibilities. 

Gaussian process regression (GPR) is an even finer approach than this. Rather 
than claiming f(x) relates to some specific models (e.g. f(x) = mx + c), a Gaussian 
process can represent f(x) obliquely, but rigorously, by letting the data ‘speak’ more 
clearly for themselves. GPR is still a form of supervised learning, but the training data 
are harnessed in a subtler way. 

As such, GPR is a less ‘parametric’ tool. However, it’s not completely free-form, 
and if we’re unwilling to make even basic assumptions about f(x), then more gen¬ 
eral techniques should be considered, including those underpinned by the principle of 
maximum entropy; Chapter 6 of Sivia and Skilling (j2006j) offers an introduction. 


1.5 


1 

0.5 

0 


>, 

- 0.5 


- 1.5 



? 


- 2.5 


- 1.6 - 1.4 - 1.2 -1 


- 0.8 - 0.6 

x 


- 0.4 - 0.2 


0.2 


Figure 1: Given six noisy data points (error bars are indicated with vertical lines), we 
are interested in estimating a seventh at x* = 0.2. 
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2 Definition of a Gaussian Process 


Gaussian processes (GPs) extend multivariate Gaussian distributions to infinite dimen¬ 
sionality. Formally, a Gaussian process generates data located throughout some domain 
such that any finite subset of the range follows a multivariate Gaussian distribution. 
Now, the n observations in an arbitrary data set, y = {yi,... ,y n }, can always be 
imagined as a single point sampled from some multivariate fn-variate) Gaussian distri¬ 
bution, after enough thought. Hence, working backwards, this data set can be partnered 
with a GP. Thus GPs are as universal as they are simple. 

Very often, it’s assumed that the mean of this partner GP is zero everywhere. What 
relates one observation to another in such cases is just the covariance function, k{x, x'). 
A popular choice is the ‘squared exponential’. 


k(x, x') = a 2 exp 


—(x — x') 2 

2l 2 


(1) 


where the maximum allowable covariance is defined as a 2 — this should be high for 
functions which cover a broad range on the y axis. If x ~ x ', then k(x, x') approaches 
this maximum, meaning f(x) is nearly perfectly correlated with fix'). This is good: 
for our function to look smooth, neighbours must be alike. Now if x is distant from 
x', we have instead k(x,x') ss 0, i.e. the two points cannot ‘see’ each other. So, for 
example, during interpolation at new x values, distant observations will have negligible 
effect. How much effect this separation has will depend on the length parameter, l, so 
there is much flexibility built into (|T]». 

Not quite enough flexibility though: the data are often noisy as well, from measure¬ 
ment errors and so on. Each observation y can be thought of as related to an underlying 
function /( x) through a Gaussian noise model: 

V = f(x) +Af(0,al), (2) 


something which should look familiar to those who’ve done regression before. Regres¬ 
sion is the search for /( x). Purely for simplicity of exposition in the next page, we take 
the novel approach of folding the noise into k(x, x'), by writing 


k(x, x') 


7 f exp 


— (x — x') 


l\21 


2 1 2 


a 2 6(x,x'), 


(3) 


where S(x, x') is the Rronecker delta function. (When most people use Gaussian pro¬ 
cesses, they keep cr n separate from k{x,x'). However, our redefinition of k(x,x') is 
equally suitable for working with problems of the sort posed in Figure [T] So, given n 
observations y, our objective is to predict j/*, not the ‘actual’ /*; their expected values 
are identical according to 0 - but their variances differ owing to the observational noise 
process, e.g. in Figure[T| the expected value of y*, and of /*, is the dot at x*.) 

To prepare for GPR, we calculate the covariance function, 0, among all possible 
combinations of these points, summarizing our findings in three matrices: 


K = 


fe(xi,Xi) 

k(x i,x 2 ) ••• 

fc(xi,x„) 



fc(x 2 ,Xt) 

k(x 2 ,x 2 ) ••• 

k(x 2 ,X„) 


(4) 

fc(x„,xi) 

k{x n ,x 2 ) ••• 

k{x n , x n f 



X»,x 2 ) • 

•• fe(x*,x„)] 

-/^5)C5)S 

= fc(x*, x»). 

(5) 


K x = [/c(x*,xi) k(a 

tifirm for yourself t 
off-diagonal elements tend to zero when x spans a large enough domain. 


Confirm for yourself that the diagonal elements of K are o 2 + a 2 , and that its extreme 
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3 How to Regress using Gaussian Processes 


Since the key assumption in GP modelling is that our data can be represented as a 
sample from a multivariate Gaussian distribution, we have that 


y 

y* 




( 6 ) 


where T indicates matrix transposition. We are of course interested in the conditional 
probability p(y*\y): “given the data, how likely is a certain prediction for y*?”. As 
explained more slowly in the Appendix, the probability follows a Gaussian distribution: 

y* |y ~ N{KJ<~ l y, K,, - KJ<~ l Kj). (7) 


Our best estimate for y, is the mean of this distribution: 

y* = AT*AT _1 y, 


( 8 ) 


and the uncertainty in our estimate is captured in its variance: 

var(y*) = K„ - K,K^ I<J. (9) 

We’re now ready to tackle the data in Figure[T| 

1. There are n = 6 observations y, at 

x = [—1.50 -1.00 -0.75 -0.40 -0.25 0.00] . 

We know a n = 0.3 from the error bars. Withjudicious choices of <jf and l (more 
on this later), we have enough to calculate a covariance matrix using Q: 



'1.70 

1.42 

1.21 

0.87 

0.72 

0.51' 


1.42 

1.70 

1.56 

1.34 

1.21 

0.97 

AT = 

1.21 

1.56 

1.70 

1.51 

1.42 

1.21 

0.87 

1.34 

1.51 

1.70 

1.59 

1.48 


0.72 

1.21 

1.42 

1.59 

1.70 

1.56 


0.51 

0.97 

1.21 

1.48 

1.56 

1.70 

From 0 we also have A'** 

= 1.70 

and 




AT* = 

[0.38 

0.79 

1.03 

1.35 

1.46 

1.58 


2. From 0 and 0, y„ = 0.95 and var(y*) = 0.21. 

3. Figure [I] shows a data point with a question mark underneath, representing the 
estimation of the dependent variable at x * = 0.2. 

We can repeat the above procedure for various other points spread over some portion 
of the x axis, as shown in Figure[2] (In fact, equivalently, we could avoid the repetition 
by performing the above procedure once with suitably larger AT* and AT** matrices. 
In this case, since there are 1,000 test points spread over the x axis, A'** would be 
of size 1,000 x 1,000.) Rather than plotting simple error bars, we’ve decided to plot 
y* ± 1.96-\/var(y*), giving a 95% confidence interval. 
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Figure 2: The solid line indicates an estimation of y, for 1,000 values of :c*. Pointwise 
95% confidence intervals are shaded. 


4 GPR in the Real World 


The reliability of our regression is dependent on how well we select the covariance 
function. Clearly if its parameters — call them 9 = {l,Of,a n } — are not cho¬ 
sen sensibly, the result is nonsense. Our maximum a posteriori estimate of 9 occurs 
when p(9 |x, y) is at its greatest. Bayes’ theorem tells us that, assuming we have little 
prior knowledge about what 9 should be, this corresponds to maximizing logp(y |x, 9), 
given by 

l°gp(y|x, 9) = -^y T A" _1 y — ^ log |AT| - |log27r. (10) 

Simply run your favourite multivariate optimization algorithm (e.g. conjugate gradi¬ 
ents, Nelder-Mead simplex, etc.) on this equation and you’ve found a pretty good 
choice for 9\ in our example, l = 1 and < 7 / = 1.27. 

It’s only “pretty good” because, of course, Thomas Bayes is rolling in his grave. 
Why commend just one answer for 9 , when you can integrate everything over the many 
different possible choices for 91 Chapter 5 of Rasmussen and Williams (2006) presents 
the equations necessary in this case. 

Finally, if you feel you’ve grasped the toy problem in Figure[2] the next two exam¬ 
ples handle more complicated cases. Figure[3ja), in addition to a long-term downward 
trend, has some fluctuations, so we might use a more sophisticated covariance function: 


k(x, x') = cry j 


— (x — x') 2 


2l\ 


a f2 exp 


~{x^x') 2 


211 


+ a 2 J(x,x'). (11) 


The first term takes into account the small vicissitudes of the dependent variable, and 
the second term has a longer length parameter (l 2 ~ GI \) to represent its long-term 
trend. Covariance functions can be grown in this way ad infinitum, to suit the complex¬ 
ity of your particular data. 
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Figure 3: Estimation of y * (solid line) for a function with (a) short-term and long-term 
dynamics, and (b) long-term dynamics and aperiodic element. Observations are shown 
as crosses. 


The function looks as if it might contain a periodic element, but it’s difficult to be 
sure. Let’s consider another function, which we’re told has a periodic element. The 
solid line in Figure[3jb) was regressed with the following covariance function: 


k(x,x') 


a 2 exp 


— (x — x') 2 
2l 2 


+ exp{— 2 sin 2 [^7r(a; — a/)]} + cr 2 5(a;, x'). 


( 12 ) 


The first term represents the hill-like trend over the long term, and the second term 
gives periodicity with frequency v. This is the first time we’ve encountered a case 
where x and x’ can be distant and yet still ‘see’ each other (that is, k(x, x') 76 0 for 
x » x'). 

What if the dependent variable has other dynamics which, a priori, you expect to 
appear? There’s no limit to how complicated k(x,x r ) can be, provided K is positive 
definite. Chapter 4 of Rasmussen and Williams| ( |2006| > offers a good outline of the 
range of covariance functions you should keep in your toolkit. 

“Hang on a minute,” you ask, “isn’t choosing a covariance function from a toolkit 
a lot like choosing a model type, such as linear versus cubic — which we discussed 
at the outset?” Well, there are indeed similarities. In fact, there is no way to perform 
regression without imposing at least a modicum of structure on the data set; such is the 
nature of generative modelling. However, it’s worth repeating that Gaussian processes 
do allow the data to speak very clearly. For example, there exists excellent theoreti¬ 
cal justification for the use of (jTj) in many settings (Rasmussen and Williams ( 2006j >, 
Section 4.3). You will still want to investigate carefully which covariance functions 
are appropriate for your data set. Essentially, choosing among alternative functions is 
a way of reflecting various forms of prior knowledge about the physical process under 
investigation. 
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5 Discussion 


We’ve presented a brief outline of the mathematics of GPR, but practical implementa¬ 
tion of the above ideas requires the solution of a few algorithmic hurdles as opposed 
to those of data analysis. If you aren’t a good computer programmer, then the code 
for Figures [l]and|2]is at github.com/mebden/GPtutorial, and more general code can be 
found at www.gaussianprocess.org/gpml. 

We’ve merely scratched the surface of a powerful technique ( MacKay] 19981. First, 
although the focus has been on one-dimensional inputs, it’s simple to accept those of 
higher dimension. Whereas x would then change from a scalar to a vector, k(x, x r ) 
would remain a scalar and so the maths overall would be virtually unchanged. Second, 
the zero vector representing the mean of the multivariate Gaussian distribution in ([6]) 
can be replaced with functions of x. Third, in addition to their use in regression, GPs 
are applicable to integration, global optimization, mixture-of-experts models, unsuper¬ 
vised learning models, and more — see Chapter 9 of Rasmussen and Williams ( 2006j >. 
The next tutorial will focus on their use in classification. 


References 

MacKay, D. (1998). In C.M. Bishop (Ed.), Neural networks and machine learning. 
(NATO ASI Series, Series F, Computer and Systems Sciences, Vol. 168, pp. 133- 
166.) Dordrecht: Kluwer Academic Press. 

Rasmussen, C. and C. Williams (2006). Gaussian Processes for Machine Learning. 
MIT Press. 

Sivia, D. and J. Skilling (2006). Data Analysis: A Bayesian Tutorial (second ed.). 
Oxford Science Publications. 


Appendix 


Imagine a data sample d taken from some multivariate Gaussian distribution with zero 
mean and a covariance given by matrix D. Now decompose d arbitrarily into two 
consecutive subvectors a and b — in other words, writing d ~ Af{ 0, D) would be the 
same as writing 




(13) 


where A, B, and C are the corresponding bits and pieces that make up D. 

Interestingly, the conditional distribution of b given a is itself Gaussian-distributed. 
If the covariance matrix D were diagonal or even block diagonal, then knowing a 
wouldn’t tell us anything about b: specifically, b|a ~ Af(0, B). On the other hand, if 
C were nonzero, then some matrix algebra leads us to 


b|a ~ Af(CA~ 1 a, B-CA- 1 ^). 


(14) 


The mean, Cb4 -1 a, is known as the ‘matrix of regression coefficients’, and the vari¬ 
ance, B — CA~ 1 C T , is the ‘Schur complement of A in D'. 

In summary, if we know some of d, we can use that to inform our estimate of what 
the rest of d might be, thanks to the revealing off-diagonal elements of D. 
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Gaussian Processes for Classification: 

A Quick Introduction 

M. Ebden, August 2008 

Prerequisite reading: Gaussian Processes for Regression 


1 Overview 


As mentioned in the previous document, GPs can be applied to problems other than 
regression. For example, if the output of a GP is squashed onto the range [0,1], it can 
represent the probability of a data point belonging to one of say two types, and voila, 
we can ascertain classifications. This is the subject of the current document. 

The big difference between GPR and GPC is how the output data, y, are linked to 
the underlying function outputs, f. They are no longer connected simply via a noise 
process as in 0 in the previous document, but are instead now discrete: say y = 1 
precisely for one class and y = — 1 for the other. In principle, we could try fitting a 
GP that produces an output of approximately 1 for some values of x and approximately 
— 1 for others, simulating this discretization. Instead, we interpose the GP between the 
data and a squashing function; then, classification of a new data point a;* involves two 
steps instead of one: 

1. Evaluate a ‘latent function’ / which models qualitatively how the likelihood of 
one class versus the other changes over the x axis. This is the GP. 

2. Squash the output of this latent function onto [0,1] using any sigmoidal function, 
tt(/) =prob(y = 1|/). 

Writing these two steps schematically. 


data, x * 

GP 
—> 

latent function, /* \x* 

sigmoid^ 

class probability, 7r(/*) 


The next section will walk you through more slowly how such a classifier operates. 
Section [3] explains how to train the classifier, so perhaps we’re presenting things in 
reverse order! Section [4]handles classification when there are more than two classes. 

Before we get started, a quick note on n(f). Although other forms will do, here 
we will prescribe it to be the cumulative Gaussian distribution, <!>('/). This S'-shaped 
function satisfies our needs, mapping high / into 7r(/) « 1, and low / into 7r(/) « 0. 

A second quick note, revisiting ([6]) and ([7J in the first document: confirm for your¬ 
self that, if there were no noise (a n = 0), the two equations could be rewritten as 


and 





/*|f ~ JfiKfK-'f, - KJi-^Kj). 


(1) 

( 2 ) 
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2 Using the Classifier 

Suppose we’ve trained a classifier from n input data, x, and their corresponding expert- 
labelled output data, y. And suppose that in the process we formed some GP outputs 
f corresponding to these data, which have some uncertainty but mean values given by 
f. We’re now ready to input a new data point, a;*, in the left side of our schematic, in 
order to determine at the other end the probability 7 r* of its class membership. 

In the first step, finding the probability p(/* |f) is similar to GPR, i.e. we adapt j2j: 

p(/*|f) = K» - K^KT'Kj). (3) 

( K' will be explained soon, but for now consider it to be very similar to K.) In the 
second step, we squash /* to find the probability of class membership, 7 r* = 7 r(/*) = 
<&(/*). The expected value is 


7T * 


= j 7r(/*)p(/*|f)d/*- 


(4) 


This is the integral of a cumulative Gaussian times a Gaussian, which can be solved 
analytically. By Section 3.9 of Rasmussen and Williams (2006), the solution is: 





/* \ 

+ var (/*)/ 


An example is depicted in Figure [T] 


(5) 


3 Training the GP in the Classifier 


Our objective now is to find f and K', so that we know everything about the GP pro¬ 
ducing ([3]), the first step of the classifier. The second step of the classifier does not 
require training as it’s a fixed sigmoidal function. 

Among the many GPs which could be partnered with our data set, naturally we’d 
like to compare their usefulness quantitatively. Considering the outputs f of a certain 
GP, how likely they are to be appropriate for the training data can be decomposed using 
Bayes’ theorem: 


p(f|x,y) 


p(y|f)p(f|x) 

p(y|x) 


( 6 ) 


Let’s focus on the two factors in the numerator. Assuming the data set is i.i.d., 


p( y| f ) = n^l/0- (7) 

2 — 1 


Dropping the subscripts in the product, p(y\f) is informed by our sigmoid function, 
7 r(/). Specifically, p{y = 1|/) is 7 r(/) by definition, and to complete the picture, 
p(y = —1|/) = 1 — 7r(/). A terse way of combining these two cases is to write 

p(y\f) = Hvf)- 

The second factor in the numerator is p(f|x). This is related to the output of the 
first step of our schematic drawing, but first we’re interested in the value of p(f |x) 
which maximizes the posterior probability p(f |x, y). This occurs when the derivative 
of (|6]l with respect to f is zero, or equivalently and more simply, when the derivative 
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Figure 1: (a) Toy classification dataset, where circles and crosses indicate class mem¬ 
bership of the input (training) data y, at locations x. y is 1 for one class and —1 for 
another, but for illustrative purposes we pretend y = 0 instead of —1 in this figure. 
The solid line is the (mean) probability 7f* = prob(y* = l|x*), i.e. the ‘answer’ to our 
problem after successfully performing GPC. (b) The corresponding distribution of the 
latent function /, not constrained to lie between 0 and 1. 


of its logarithm is zero. Doing this, and using the same logic that produced ( [T()| in the 
previous document, we find that 

f = A'Vlogp(y|f), (8) 

where f is the best f for our problem. Unfortunately, f appears on both sides of the 
equation, so we make an initial guess (zero is fine) and go through a few iterations. The 
answer to ([8]) can be used directly in (|3j, so we’ve found one of the two quantities we 
seek therein. 

The variance of f is given by the negative second derivative of the logarithm of |[6}, 
which turns out to be (K~ x + W)~ x , with W = — VV logp(y|f). Making a Laplace 
approximation, we pretend p(f |x, y) is Gaussian distributed, i.e. 

P(f |x, y) ~ <?(f |x, y) = AT(f, (A" -1 + fU)" 1 ). (9) 

(This assumption is occasionally inaccurate, so if it yields poor classifications, bet¬ 
ter ways of characterizing the uncertainty in f should be considered, for example via 
expectation propagation.) 

Now for a subtle point. The fact that f can vary means that using (|2]i directly is in¬ 
appropriate: in particular, its mean is correct but its variance no longer tells the whole 
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story. This is why we use the adapted version, 0. with K' instead of K. Since the 
varying quantity in (pi, f, is being multiplied by K, K ~ 1 , we add K, K ~ 1 cov(f j K~ l I\J 
to the variance in (pJT Simplification leads to 0, in which K' = K + W 1 . 

With the GP now completely specified, we’re ready to use the classifier as described 
in the previous section. 

GPC in the Real World 

As with GPR, the reliability of our classification is dependent on how well we select 
the covariance function in the GP in our first step. The parameters are 6 = 
one fewer now because a n = 0. However, as usual, 0 is optimized by maximizing 
p(y|x, 6 ), or (omitting 6 on the righthand side of the equation), 

p(y|x,0) = J p(y|f)p(f |x)df. (10) 

This can be simplified, using a Laplace approximation, to yield 

p(y|x, 0) = —FK-'f + logp(y|f) - i log(|A'| • \K~ X + W|). (11) 

This is the equation to run your favourite optimizer on, as performed in GPR. 


4 Multi-Class GPC 


We’ve described binary classification, where the number of possible classes, C, is just 
two. In the case of C > 2 classes, one approach is to fit an / for each class. In the first 
of the two steps of classification, our GP values are concatenated as 


{= (f 1 f 1 f 2 f 2 f c f c ) T 


( 12 ) 


Let y be a vector of the same length as f which, for each i = 1,..., n, is 1 for the class 
which is the label and 0 for the other C — 1 entries. Let K grow to being block diagonal 
in the matrices K 1 , ..., I\ c . So the first change we see for C > 2 is a lengthening 
of the GP. Section 3.5 of Rasmussen and Willi ams] ( |2006[ ) offers hints on keeping the 
computations manageable. 

The second change is that the (merely one-dimensional) cumulative Gaussian dis¬ 
tribution is no longer sufficient to describe the squashing function in our classifier; 
instead we use the softmax function. For the ith data point. 




exp(/f) 

E C ' ex p(/f) 


(13) 


where f, is a nonconsecutive subset of f, viz. f, = {/l, ff, ..., ff}. We can summa¬ 
rize our results with 7r = {tt\, ... ,7r^,7r^,.... 7r^,... ,7rp,... ,7rp}. 

Now that we’ve presented the two big changes needed to go from binary- to multi¬ 
class GPC, we continue as before. Setting to zero the derivative of the logarithm of the 
components in ([6]), we replace (|8]i with 

f = K ( y - 7 r ). (14) 


The corresponding variance is (/\ _1 + kF)~ 1 as before, but now W = diag(7r) —nn T , 
where n is a Cn x n matrix obtained by stacking vertically the diagonal matrices 
diag(7r c ), if 7r c is the subvector of tt pertaining to class c. 
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With these quantities estimated, we have enough to generalize ([3]) to 

p(/* c If) = f C - diag (K„) - {K:f{K c + (W c )- 1 )- 1 (i£) T ), (15) 

where /*, A'J, and PF C represent the class-relevant information only. Finally, © is 
replaced with 


n 

p(y |x, e ) = --f T ^- x f + y T f ^ log 


z=l 




C— 1 


--log(\K\.\K~ l +W\). 

(16) 


We won’t present an example of multi-class GPC, but hopefully you get the idea. 


5 Discussion 

As with GPR, classification can be extended to accept x values with multiple dimen¬ 
sions, while keeping most of the mathematics unchanged. Other possible extensions 
include using the expectation propagation method in lieu of the Laplace approximation 
as mentioned previously, putting confidence intervals on the classification probabili¬ 
ties, calculating the derivatives of ( [T6| to aid the optimizer, or using the variational 
Gaussian process classifiers described in |MacKay ( 1998| >, to name but four extensions. 

Second, we repeat the Bayesian call from the previous document to integrate over 
a range of possible covariance function parameters. This should be done regardless 
of how much prior knowledge is available — see for example Chapter 5 of |Sivia and 
Skilling (2006) on how to choose priors in the most opaque situations. 

Third, we’ve again spared you a few practical algorithmic details; computer code 
is available at www.gaussianprocess.org/gpml, with examples. 
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Gaussian Processes for Dimensionality Reduction: 
A Quick Introduction 

M. Ebden, August 2015 

Prerequisite reading: Gaussian processes for regression 


Suppose you wanted to learn a low-dimensional representation of a dataset for ease of 
interpretation, sort of like how your shadow is a simplified, two-dimensional represen¬ 
tation of a three-dimensional object. Let the original data be matrix Y with n rows 
(observations) of d dimensions each, and let the new representation be X with n rows 
of q dimensions each (q < d). 

To learn this low-dimensional representation, we’ll assume that for the ith di¬ 
mension of Y the n elements in y : , are a sample from a Gaussian process based 
on the low-dimensional space. Specifically, we’ll use a zero-mean Gaussian process, 
QV (0„ x i ■ &(x, x')), keeping the same model for each dimension of Y. We’ll choose 
for fc(-) the squared exponential kernel (a.ka. RBF kernel) in order to ensure that points 
close in X will be close in Y. Renaming the noise variance from a 2 (in the first report) 
to 1//3, the kernel is: 


fc(x, x') = a 2 exp 


,'|21 


2/ 2 


S(x, x') 

P ‘ 


The covariance matrix K for this GP is constructed as per (4) in the first report, and 
we have no need for a K* or K, . from (5) because there are no points in Y to predict. 
The tuple of kernel parameters this time is 9 = {a, l, /?}. 

We’ll further assume that the GPs behind each of the d dimensions have been sam¬ 
pled independently. Therefore, the likelihood of the observed values in Y is the product 
of d independent GPs: 


P(Y|X, 9) 


n 


exp[-|y^K V:,*] 

(27r) n / 2 |K| 1 / 2 


Then we can adjust X and 9 to maximize this likelihood. If we were only adjusting 
X, it would be akin to determining the shadow which is most likely to correspond to 
your body’s shape. Because we’re adjusting 6 as well, imagine manipulating the light 
source and the surface for the shadow to appear on as well. 

Figure [T] gives an example outcome of the most likely X for a certain Y, with 
n = 17, d = 2, and q = 1. Y was preprocessed to have zero mean and the same 
variance in each dimension. Using an optimizer based on scaled conjugate gradients, 
the optimal 9 and X were then found. The former is {a, l, 6} = {1.05,3.3 x 10 —4 ,93} 
and the latter is given in Figure[I]b). In this example, X is more than just a ‘shadow’ 
(simple projection) of Y: the above method of dimensionality reduction is powerful 
enough to have learned that the points in the left curve in Figure |Tf a) should be grouped 
together while the points in the right curve form a separate group; no simple lamplight 
projection can achieve this. The price we pay for this flexibility is in the high number 
of parameters being fit: the total here is 3 +17 x 1 = 20, i.e. 9 plus the one-dimensional 
values in X. Usually X are referred to as latent variables rather than parameters, and 
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our overall setup is called the ‘Gaussian process latent variable model’ (GP-LVM). The 
technique was first presented by Neil Lawrence in 2004; he examined a set of oil-flow 
data in which n = 1000 and d = 12, which we reduce to q = 2 in Figure[2] 

This tutorial on the GP-LVM is provided by Mind Foundry, a technology spin-out 
from the University of Oxford. 
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Figure 1: (a) Seventeen data points, Y, each of two dimensions. 

(b) A one-dimensional projection, X, of those points. 



Figure 2: Although the GP-LVM algorithm uses unsupervised learning, the three 
classes of data in this originally 12-D dataset generally occupy different regions. 
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